<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
    One-Stop Analytics: R
</h1>

Case Study of Autism Spectrum Disorder (ASD) with R


[ United States ]

Centers for Disease Control and Prevention (CDC) - Autism Spectrum Disorder (ASD)

Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.

https://www.cdc.gov/ncbddd/autism/data/index.html

[ Singapore ]

TODAY Online - More preschoolers diagnosed with developmental issues

Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.

https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues

https://www.pathlight.org.sg/

<a href="">
</a>

Workshop Objective:

Use R to analyze Autism Spectrum Disorder (ASD) data from CDC USA.

https://www.cdc.gov/ncbddd/autism/data/index.html

  • R Fundamentals

  • Data Summarization

  • Data Visualisation (Base Graphic)

  • Appendices

<a href="">
</a>

R Fundamentals

<h3>
R Fundamentals - Get & Set working directory
</h3>

Obtain current R working directory

getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"

Set new R working directory

# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"

Read in CSV data, storing as R dataframe

# Dataset: US. National Level Children ASD Prevalence
ASD_National <- read.csv("../dataset/ADV_ASD_National.csv", stringsAsFactors = FALSE)
# Dataset: US. State Level Children ASD Prevalence
ASD_State    <- read.csv("../dataset/ADV_ASD_State.csv", stringsAsFactors = FALSE)

Look at first/last few rows of data

head(ASD_National)
##   Source Year Prevalence Upper.CI Lower.CI Prevalence_dup
## 1   addm 2000        6.7      7.0      6.3            6.7
## 2   addm 2002        6.6      6.8      6.3            6.6
## 3   addm 2004        8.0      8.4      7.6            8.0
## 4   addm 2006        9.0      9.3      8.6            9.0
## 5   addm 2008       11.3     11.7     11.0           11.3
## 6   addm 2010       14.7     15.1     14.3           14.7
##                                             Source_Full1
## 1 Autism & Developmental Disabilities Monitoring Network
## 2 Autism & Developmental Disabilities Monitoring Network
## 3 Autism & Developmental Disabilities Monitoring Network
## 4 Autism & Developmental Disabilities Monitoring Network
## 5 Autism & Developmental Disabilities Monitoring Network
## 6 Autism & Developmental Disabilities Monitoring Network
##                                                  Source_Full2 Male.Prevalence
## 1 addm-Autism & Developmental Disabilities Monitoring Network         No data
## 2 addm-Autism & Developmental Disabilities Monitoring Network            11.5
## 3 addm-Autism & Developmental Disabilities Monitoring Network            12.9
## 4 addm-Autism & Developmental Disabilities Monitoring Network            14.5
## 5 addm-Autism & Developmental Disabilities Monitoring Network            18.4
## 6 addm-Autism & Developmental Disabilities Monitoring Network            23.7
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1       No data       No data           No data         No data         No data
## 2       No data       No data               2.7         No data         No data
## 3          12.2          13.7               2.9             2.6             3.3
## 4          13.9          15.1               3.2             2.9             3.5
## 5          17.7            19                 4             3.7             4.3
## 6            23          24.4               5.3               5             5.7
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                       No data                     No data
## 2                           7.7                     No data
## 3                           9.7                         9.1
## 4                           9.9                         9.4
## 5                            12                        11.5
## 6                          15.8                        15.2
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                     No data                       No data
## 2                     No data                           6.5
## 3                        10.4                           6.9
## 4                        10.4                           7.2
## 5                        12.5                          10.2
## 6                        16.3                          12.3
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                     No data                     No data             No data
## 2                     No data                     No data             No data
## 3                         6.2                         7.6                 6.2
## 4                         6.6                         7.8                 5.9
## 5                         9.5                        10.9                 7.9
## 6                        11.5                        13.1                10.8
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1           No data           No data                              No data
## 2           No data           No data                              No data
## 3                 5               7.5                              No data
## 4               5.3               6.6                              No data
## 5               7.2               8.6                                  9.7
## 6                10              11.6                                 12.3
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                            No data                            No data
## 2                            No data                            No data
## 3                            No data                            No data
## 4                            No data                            No data
## 5                                8.1                               11.6
## 6                               10.7                               14.2
tail(ASD_State)
##      State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1687    UT      596257        8.7      8.5      9.0 2016   sped
## 1688    VT       74108       12.1     11.3     12.9 2016   sped
## 1689    VA     1162945       14.2     14.0     14.4 2016   sped
## 1690    WA     1006676       11.2     11.0     11.4 2016   sped
## 1691    WV      239037        8.6      8.3      9.0 2016   sped
## 1692    WY       85922        9.3      8.7     10.0 2016   sped
##                       Source_Full1   State_Full1      State_Full2 Numerator_ASD
## 1687 Special Education Child Count          Utah          UT-Utah          5187
## 1688 Special Education Child Count       Vermont       VT-Vermont           897
## 1689 Special Education Child Count      Virginia      VA-Virginia         16514
## 1690 Special Education Child Count    Washington    WA-Washington         11275
## 1691 Special Education Child Count West Virginia WV-West Virginia          2056
## 1692 Special Education Child Count       Wyoming       WY-Wyoming           799
##      Numerator_NonASD  Proportion    X95_Z_CI Z_Lower.CI Z_Upper.CI
## 1687           591070 0.008699269 0.000235709   8.463560   8.934978
## 1688            73211 0.012103956 0.000787290  11.316666  12.891247
## 1689          1146431 0.014200156 0.000215035  13.985121  14.415191
## 1690           995401 0.011200227 0.000205575  10.994652  11.405803
## 1691           236981 0.008601179 0.000370185   8.230994   8.971364
## 1692            85123 0.009299132 0.000641783   8.657349   9.940915
##      Z_Lower.CI_ABSerror Z_Upper.CI_ABSerror Chi_Wilson_P X95_Chi_Wilson_CI
## 1687         0.036439666         0.065022456  0.008702434       0.000235729
## 1688         0.016666193         0.008753417  0.012129246       0.000787676
## 1689         0.014879497         0.015190775  0.014201760       0.000215041
## 1690         0.005347969         0.005802534  0.011202093       0.000205583
## 1691         0.069006017         0.028636189  0.008609076       0.000370266
## 1692         0.042651475         0.059084984  0.009321069       0.000642144
##      Chi_Wilson_Lower.CI Chi_Wilson_Upper.CI Chi_Wilson_Lower.CI_ABSerror
## 1687            8.466705            8.938163                  0.033294913
## 1688           11.341570           12.916922                  0.041569768
## 1689           13.986720           14.416801                  0.013280432
## 1690           10.996509           11.407676                  0.003490794
## 1691            8.238810            8.979342                  0.061190335
## 1692            8.678926            9.963213                  0.021074361
##      Chi_Wilson_Upper.CI_ABSerror Chi_Wilson_Corrected_w_minus.CI
## 1687                  0.061836719                     0.008465878
## 1688                  0.016921499                     0.011335040
## 1689                  0.016801104                     0.013986293
## 1690                  0.007675848                     0.010996017
## 1691                  0.020658015                     0.008236763
## 1692                  0.036786888                     0.008673305
##      Chi_Wilson_Corrected_w_plus.CI Chi_Wilson_Corrected_Lower.CI
## 1687                    0.008939013                      8.465878
## 1688                    0.012923885                     11.335040
## 1689                    0.014417234                     13.986293
## 1690                    0.011408177                     10.996017
## 1691                    0.008981478                      8.236763
## 1692                    0.009969231                      8.673305
##      Chi_Wilson_Corrected_Upper.CI Chi_Wilson_Corrected_Lower.CI_ABSerror
## 1687                      8.939013                             0.03412221
## 1688                     12.923885                             0.03503985
## 1689                     14.417234                             0.01370717
## 1690                     11.408177                             0.00398297
## 1691                      8.981478                             0.06323741
## 1692                      9.969231                             0.02669451
##      Chi_Wilson_Corrected_Upper.CI_ABSerror Male.Prevalence Male.Lower.CI
## 1687                            0.060986900              NA            NA
## 1688                            0.023884634              NA            NA
## 1689                            0.017234254              NA            NA
## 1690                            0.008177037              NA            NA
## 1691                            0.018521714              NA            NA
## 1692                            0.030769154              NA            NA
##      Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1687            NA                NA              NA              NA
## 1688            NA                NA              NA              NA
## 1689            NA                NA              NA              NA
## 1690            NA                NA              NA              NA
## 1691            NA                NA              NA              NA
## 1692            NA                NA              NA              NA
##      Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1687                            NA                          NA
## 1688                            NA                          NA
## 1689                            NA                          NA
## 1690                            NA                          NA
## 1691                            NA                          NA
## 1692                            NA                          NA
##      Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1687                          NA                              
## 1688                          NA                              
## 1689                          NA                              
## 1690                          NA                              
## 1691                          NA                              
## 1692                          NA                              
##      Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI
## 1687                                                        
## 1688                                                        
## 1689                                                        
## 1690                                                        
## 1691                                                        
## 1692                                                        
##      Hispanic.Prevalence Hispanic.Lower.CI Hispanic.Upper.CI
## 1687                                                        
## 1688                                                        
## 1689                                                        
## 1690                                                        
## 1691                                                        
## 1692                                                        
##      Asian.or.Pacific.Islander.Prevalence Asian.or.Pacific.Islander.Lower.CI
## 1687                                                                        
## 1688                                                                        
## 1689                                                                        
## 1690                                                                        
## 1691                                                                        
## 1692                                                                        
##      Asian.or.Pacific.Islander.Upper.CI      State_Region
## 1687                                          D8 Mountain
## 1688                                       D1 New England
## 1689                                    D5 South Atlantic
## 1690                                           D9 Pacific
## 1691                                    D5 South Atlantic
## 1692                                          D8 Mountain

Obtain number of rows and number of columns/features/variables

dim(ASD_National)
## [1] 42 26
dim(ASD_State)
## [1] 1692   49

Obtain overview (data structure/types)

str(ASD_National)
## 'data.frame':    42 obs. of  26 variables:
##  $ Source                              : chr  "addm" "addm" "addm" "addm" ...
##  $ Year                                : int  2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
##  $ Prevalence                          : num  6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
##  $ Upper.CI                            : num  7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
##  $ Lower.CI                            : num  6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
##  $ Prevalence_dup                      : num  6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
##  $ Source_Full1                        : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ Source_Full2                        : chr  "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
##  $ Male.Prevalence                     : chr  "No data" "11.5" "12.9" "14.5" ...
##  $ Male.Lower.CI                       : chr  "No data" "No data" "12.2" "13.9" ...
##  $ Male.Upper.CI                       : chr  "No data" "No data" "13.7" "15.1" ...
##  $ Female.Prevalence                   : chr  "No data" "2.7" "2.9" "3.2" ...
##  $ Female.Lower.CI                     : chr  "No data" "No data" "2.6" "2.9" ...
##  $ Female.Upper.CI                     : chr  "No data" "No data" "3.3" "3.5" ...
##  $ Non.hispanic.white.Prevalence       : chr  "No data" "7.7" "9.7" "9.9" ...
##  $ Non.hispanic.white.Lower.CI         : chr  "No data" "No data" "9.1" "9.4" ...
##  $ Non.hispanic.white.Upper.CI         : chr  "No data" "No data" "10.4" "10.4" ...
##  $ Non.hispanic.black.Prevalence       : chr  "No data" "6.5" "6.9" "7.2" ...
##  $ Non.hispanic.black.Lower.CI         : chr  "No data" "No data" "6.2" "6.6" ...
##  $ Non.hispanic.black.Upper.CI         : chr  "No data" "No data" "7.6" "7.8" ...
##  $ Hispanic.Prevalence                 : chr  "No data" "No data" "6.2" "5.9" ...
##  $ Hispanic.Lower.CI                   : chr  "No data" "No data" "5" "5.3" ...
##  $ Hispanic.Upper.CI                   : chr  "No data" "No data" "7.5" "6.6" ...
##  $ Asian.or.Pacific.Islander.Prevalence: chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Lower.CI  : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Upper.CI  : chr  "No data" "No data" "No data" "No data" ...
str(ASD_State)
## 'data.frame':    1692 obs. of  49 variables:
##  $ State                                 : chr  "AZ" "GA" "MD" "NJ" ...
##  $ Denominator                           : int  45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
##  $ Prevalence                            : num  6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
##  $ Lower.CI                              : num  5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
##  $ Upper.CI                              : num  7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
##  $ Year                                  : int  2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
##  $ Source                                : chr  "addm" "addm" "addm" "addm" ...
##  $ Source_Full1                          : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ State_Full1                           : chr  "Arizona" "Georgia" "Maryland" "New Jersey" ...
##  $ State_Full2                           : chr  "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
##  $ Numerator_ASD                         : int  295 283 118 294 155 104 117 280 252 65 ...
##  $ Numerator_NonASD                      : int  45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
##  $ Proportion                            : num  0.00651 0.00649 0.00548 0.00989 0.00632 ...
##  $ X95_Z_CI                              : num  0.00074 0.000754 0.000986 0.001125 0.000991 ...
##  $ Z_Lower.CI                            : num  5.77 5.74 4.49 8.77 5.33 ...
##  $ Z_Upper.CI                            : num  7.25 7.25 6.47 11.02 7.31 ...
##  $ Z_Lower.CI_ABSerror                   : num  0.0314 0.062 0.1059 0.1311 0.0739 ...
##  $ Z_Upper.CI_ABSerror                   : num  0.0507 0.0542 0.1337 0.0803 0.0911 ...
##  $ Chi_Wilson_P                          : num  0.00655 0.00654 0.00557 0.00996 0.00639 ...
##  $ X95_Chi_Wilson_CI                     : num  0.000741 0.000755 0.00099 0.001127 0.000994 ...
##  $ Chi_Wilson_Lower.CI                   : num  5.81 5.78 4.58 8.83 5.4 ...
##  $ Chi_Wilson_Upper.CI                   : num  7.29 7.29 6.56 11.08 7.39 ...
##  $ Chi_Wilson_Lower.CI_ABSerror          : num  0.009314 0.019761 0.021503 0.069416 0.000453 ...
##  $ Chi_Wilson_Upper.CI_ABSerror          : num  0.0077 0.00953 0.04165 0.01523 0.01087 ...
##  $ Chi_Wilson_Corrected_w_minus.CI       : num  0.0058 0.00577 0.00456 0.00881 0.00538 ...
##  $ Chi_Wilson_Corrected_w_plus.CI        : num  0.0073 0.0073 0.00658 0.0111 0.00741 ...
##  $ Chi_Wilson_Corrected_Lower.CI         : num  5.8 5.77 4.56 8.81 5.38 ...
##  $ Chi_Wilson_Corrected_Upper.CI         : num  7.3 7.3 6.58 11.1 7.41 ...
##  $ Chi_Wilson_Corrected_Lower.CI_ABSerror: num  0.00109 0.03057 0.04265 0.08529 0.01834 ...
##  $ Chi_Wilson_Corrected_Upper.CI_ABSerror: num  0.00395 0.0026 0.01636 0.00254 0.01108 ...
##  $ Male.Prevalence                       : num  9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
##  $ Male.Lower.CI                         : num  8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
##  $ Male.Upper.CI                         : num  11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
##  $ Female.Prevalence                     : num  3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
##  $ Female.Lower.CI                       : num  2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
##  $ Female.Upper.CI                       : num  4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
##  $ Non.hispanic.white.Prevalence         : num  8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
##  $ Non.hispanic.white.Lower.CI           : num  7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
##  $ Non.hispanic.white.Upper.CI           : num  9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
##  $ Non.hispanic.black.Prevalence         : chr  "7.3" "5.3" "6.1" "10.6" ...
##  $ Non.hispanic.black.Lower.CI           : chr  "4.4" "4.4" "4.7" "8.5" ...
##  $ Non.hispanic.black.Upper.CI           : chr  "12.2" "6.4" "8" "13.1" ...
##  $ Hispanic.Prevalence                   : chr  "No data" "No data" "No data" "No data" ...
##  $ Hispanic.Lower.CI                     : chr  "No data" "No data" "No data" "No data" ...
##  $ Hispanic.Upper.CI                     : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Prevalence  : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Lower.CI    : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Upper.CI    : chr  "No data" "No data" "No data" "No data" ...
##  $ State_Region                          : chr  "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...

Obtain name of columns

names(ASD_National)
##  [1] "Source"                              
##  [2] "Year"                                
##  [3] "Prevalence"                          
##  [4] "Upper.CI"                            
##  [5] "Lower.CI"                            
##  [6] "Prevalence_dup"                      
##  [7] "Source_Full1"                        
##  [8] "Source_Full2"                        
##  [9] "Male.Prevalence"                     
## [10] "Male.Lower.CI"                       
## [11] "Male.Upper.CI"                       
## [12] "Female.Prevalence"                   
## [13] "Female.Lower.CI"                     
## [14] "Female.Upper.CI"                     
## [15] "Non.hispanic.white.Prevalence"       
## [16] "Non.hispanic.white.Lower.CI"         
## [17] "Non.hispanic.white.Upper.CI"         
## [18] "Non.hispanic.black.Prevalence"       
## [19] "Non.hispanic.black.Lower.CI"         
## [20] "Non.hispanic.black.Upper.CI"         
## [21] "Hispanic.Prevalence"                 
## [22] "Hispanic.Lower.CI"                   
## [23] "Hispanic.Upper.CI"                   
## [24] "Asian.or.Pacific.Islander.Prevalence"
## [25] "Asian.or.Pacific.Islander.Lower.CI"  
## [26] "Asian.or.Pacific.Islander.Upper.CI"
names(ASD_State)
##  [1] "State"                                 
##  [2] "Denominator"                           
##  [3] "Prevalence"                            
##  [4] "Lower.CI"                              
##  [5] "Upper.CI"                              
##  [6] "Year"                                  
##  [7] "Source"                                
##  [8] "Source_Full1"                          
##  [9] "State_Full1"                           
## [10] "State_Full2"                           
## [11] "Numerator_ASD"                         
## [12] "Numerator_NonASD"                      
## [13] "Proportion"                            
## [14] "X95_Z_CI"                              
## [15] "Z_Lower.CI"                            
## [16] "Z_Upper.CI"                            
## [17] "Z_Lower.CI_ABSerror"                   
## [18] "Z_Upper.CI_ABSerror"                   
## [19] "Chi_Wilson_P"                          
## [20] "X95_Chi_Wilson_CI"                     
## [21] "Chi_Wilson_Lower.CI"                   
## [22] "Chi_Wilson_Upper.CI"                   
## [23] "Chi_Wilson_Lower.CI_ABSerror"          
## [24] "Chi_Wilson_Upper.CI_ABSerror"          
## [25] "Chi_Wilson_Corrected_w_minus.CI"       
## [26] "Chi_Wilson_Corrected_w_plus.CI"        
## [27] "Chi_Wilson_Corrected_Lower.CI"         
## [28] "Chi_Wilson_Corrected_Upper.CI"         
## [29] "Chi_Wilson_Corrected_Lower.CI_ABSerror"
## [30] "Chi_Wilson_Corrected_Upper.CI_ABSerror"
## [31] "Male.Prevalence"                       
## [32] "Male.Lower.CI"                         
## [33] "Male.Upper.CI"                         
## [34] "Female.Prevalence"                     
## [35] "Female.Lower.CI"                       
## [36] "Female.Upper.CI"                       
## [37] "Non.hispanic.white.Prevalence"         
## [38] "Non.hispanic.white.Lower.CI"           
## [39] "Non.hispanic.white.Upper.CI"           
## [40] "Non.hispanic.black.Prevalence"         
## [41] "Non.hispanic.black.Lower.CI"           
## [42] "Non.hispanic.black.Upper.CI"           
## [43] "Hispanic.Prevalence"                   
## [44] "Hispanic.Lower.CI"                     
## [45] "Hispanic.Upper.CI"                     
## [46] "Asian.or.Pacific.Islander.Prevalence"  
## [47] "Asian.or.Pacific.Islander.Lower.CI"    
## [48] "Asian.or.Pacific.Islander.Upper.CI"    
## [49] "State_Region"

Display column name with its index number

cbind(names(ASD_National), c(1:length(names(ASD_National))))
##       [,1]                                   [,2]
##  [1,] "Source"                               "1" 
##  [2,] "Year"                                 "2" 
##  [3,] "Prevalence"                           "3" 
##  [4,] "Upper.CI"                             "4" 
##  [5,] "Lower.CI"                             "5" 
##  [6,] "Prevalence_dup"                       "6" 
##  [7,] "Source_Full1"                         "7" 
##  [8,] "Source_Full2"                         "8" 
##  [9,] "Male.Prevalence"                      "9" 
## [10,] "Male.Lower.CI"                        "10"
## [11,] "Male.Upper.CI"                        "11"
## [12,] "Female.Prevalence"                    "12"
## [13,] "Female.Lower.CI"                      "13"
## [14,] "Female.Upper.CI"                      "14"
## [15,] "Non.hispanic.white.Prevalence"        "15"
## [16,] "Non.hispanic.white.Lower.CI"          "16"
## [17,] "Non.hispanic.white.Upper.CI"          "17"
## [18,] "Non.hispanic.black.Prevalence"        "18"
## [19,] "Non.hispanic.black.Lower.CI"          "19"
## [20,] "Non.hispanic.black.Upper.CI"          "20"
## [21,] "Hispanic.Prevalence"                  "21"
## [22,] "Hispanic.Lower.CI"                    "22"
## [23,] "Hispanic.Upper.CI"                    "23"
## [24,] "Asian.or.Pacific.Islander.Prevalence" "24"
## [25,] "Asian.or.Pacific.Islander.Lower.CI"   "25"
## [26,] "Asian.or.Pacific.Islander.Upper.CI"   "26"

Look at data structure/schema (Selected columns)

str(ASD_National[, c(1:8, 24, 25, 26)])
## 'data.frame':    42 obs. of  11 variables:
##  $ Source                              : chr  "addm" "addm" "addm" "addm" ...
##  $ Year                                : int  2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
##  $ Prevalence                          : num  6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
##  $ Upper.CI                            : num  7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
##  $ Lower.CI                            : num  6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
##  $ Prevalence_dup                      : num  6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
##  $ Source_Full1                        : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ Source_Full2                        : chr  "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
##  $ Asian.or.Pacific.Islander.Prevalence: chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Lower.CI  : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Upper.CI  : chr  "No data" "No data" "No data" "No data" ...
<h3>
    Quiz:
</h3>
<p>
    Obtain feature/column names and column index of dataframe: ASD_State
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
R Fundamentals - Work with dataframe
</h3>

Access column 1 as a named list:

# use column index:
ASD_National[1]
##    Source
## 1    addm
## 2    addm
## 3    addm
## 4    addm
## 5    addm
## 6    addm
## 7    addm
## 8    addm
## 9    nsch
## 10   nsch
## 11   nsch
## 12   nsch
## 13   sped
## 14   sped
## 15   sped
## 16   sped
## 17   sped
## 18   sped
## 19   sped
## 20   sped
## 21   sped
## 22   sped
## 23   sped
## 24   sped
## 25   sped
## 26   sped
## 27   sped
## 28   sped
## 29   sped
## 30   medi
## 31   medi
## 32   medi
## 33   medi
## 34   medi
## 35   medi
## 36   medi
## 37   medi
## 38   medi
## 39   medi
## 40   medi
## 41   medi
## 42   medi
typeof(ASD_National[1])
## [1] "list"
ASD_National[1]$Source
##  [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
typeof(ASD_National[1]$Source)
## [1] "character"
# use column name:
ASD_National["Source"]
##    Source
## 1    addm
## 2    addm
## 3    addm
## 4    addm
## 5    addm
## 6    addm
## 7    addm
## 8    addm
## 9    nsch
## 10   nsch
## 11   nsch
## 12   nsch
## 13   sped
## 14   sped
## 15   sped
## 16   sped
## 17   sped
## 18   sped
## 19   sped
## 20   sped
## 21   sped
## 22   sped
## 23   sped
## 24   sped
## 25   sped
## 26   sped
## 27   sped
## 28   sped
## 29   sped
## 30   medi
## 31   medi
## 32   medi
## 33   medi
## 34   medi
## 35   medi
## 36   medi
## 37   medi
## 38   medi
## 39   medi
## 40   medi
## 41   medi
## 42   medi
ASD_National['Source']$Source
##  [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"

Access column 1 as a set of string/chr:

ASD_National[, 1]
##  [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
# or
ASD_National[, "Source"]
##  [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
# or
ASD_National$Source
##  [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
typeof(ASD_National$Source)
## [1] "character"

Count number of elements in a object:

length(ASD_National) # number of features/columns
## [1] 26
length(ASD_National[1, ]) # number of elements(columns) in row 1
## [1] 26
length(ASD_National[, 1]) # number of elements(rows) in column 1
## [1] 42
length(ASD_National[, "Source"]) # same as above
## [1] 42
length(ASD_National$Source) # number of elements in chr list
## [1] 42

Access elements from dataframe

# using column index
ASD_National[1][1, ]
## [1] "addm"
ASD_National[1][11, ]
## [1] "nsch"
ASD_National[1][11:20, ]
##  [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using column name
ASD_National["Source"][1, ]
## [1] "addm"
ASD_National["Source"][11, ]
## [1] "nsch"
ASD_National["Source"][11:20, ]
##  [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"

Access elements from dataframe

# using column index
ASD_National[, 1][1]
## [1] "addm"
ASD_National[, 1][11]
## [1] "nsch"
ASD_National[, 1][11:20]
##  [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using column name
ASD_National[, "Source"][1]
## [1] "addm"
# using column name
ASD_National[, "Source"][11]
## [1] "nsch"
# using column name
ASD_National[, "Source"][11:20]
##  [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using $ operator
ASD_National$Source[1]
## [1] "addm"
ASD_National$Source[11]
## [1] "nsch"
ASD_National$Source[11:20]
##  [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"

Access elements of different column:

cbind(names(ASD_National), c(1:length(names(ASD_National))))
##       [,1]                                   [,2]
##  [1,] "Source"                               "1" 
##  [2,] "Year"                                 "2" 
##  [3,] "Prevalence"                           "3" 
##  [4,] "Upper.CI"                             "4" 
##  [5,] "Lower.CI"                             "5" 
##  [6,] "Prevalence_dup"                       "6" 
##  [7,] "Source_Full1"                         "7" 
##  [8,] "Source_Full2"                         "8" 
##  [9,] "Male.Prevalence"                      "9" 
## [10,] "Male.Lower.CI"                        "10"
## [11,] "Male.Upper.CI"                        "11"
## [12,] "Female.Prevalence"                    "12"
## [13,] "Female.Lower.CI"                      "13"
## [14,] "Female.Upper.CI"                      "14"
## [15,] "Non.hispanic.white.Prevalence"        "15"
## [16,] "Non.hispanic.white.Lower.CI"          "16"
## [17,] "Non.hispanic.white.Upper.CI"          "17"
## [18,] "Non.hispanic.black.Prevalence"        "18"
## [19,] "Non.hispanic.black.Lower.CI"          "19"
## [20,] "Non.hispanic.black.Upper.CI"          "20"
## [21,] "Hispanic.Prevalence"                  "21"
## [22,] "Hispanic.Lower.CI"                    "22"
## [23,] "Hispanic.Upper.CI"                    "23"
## [24,] "Asian.or.Pacific.Islander.Prevalence" "24"
## [25,] "Asian.or.Pacific.Islander.Lower.CI"   "25"
## [26,] "Asian.or.Pacific.Islander.Upper.CI"   "26"
ASD_National[1, 1] # row 1, column 1: "Source" 
## [1] "addm"
ASD_National[10, 1] # row 10, column 1: "Source"
## [1] "nsch"
ASD_National[1, 3] # row 1, column 3: "Prevalence"
## [1] 6.7
ASD_National[10, 3] # row 10, column 3: "Prevalence"
## [1] 16.2
ASD_National[1:10, 1:3] # row 1 to 10 from column 1 to 3
##    Source Year Prevalence
## 1    addm 2000        6.7
## 2    addm 2002        6.6
## 3    addm 2004        8.0
## 4    addm 2006        9.0
## 5    addm 2008       11.3
## 6    addm 2010       14.7
## 7    addm 2012       14.8
## 8    addm 2014       16.8
## 9    nsch 2004        9.5
## 10   nsch 2008       16.2
# or using columns names
ASD_National[1:10, c('Source', 'Year', 'Prevalence')]
##    Source Year Prevalence
## 1    addm 2000        6.7
## 2    addm 2002        6.6
## 3    addm 2004        8.0
## 4    addm 2006        9.0
## 5    addm 2008       11.3
## 6    addm 2010       14.7
## 7    addm 2012       14.8
## 8    addm 2014       16.8
## 9    nsch 2004        9.5
## 10   nsch 2008       16.2
ASD_National[c(1:10, 20, 30:35), c(1:3, 9, 12)] # row 1 to 10, 20, and 20 to 25 from column 1 to 3, 9, and 12
##    Source Year Prevalence Male.Prevalence Female.Prevalence
## 1    addm 2000        6.7         No data           No data
## 2    addm 2002        6.6            11.5               2.7
## 3    addm 2004        8.0            12.9               2.9
## 4    addm 2006        9.0            14.5               3.2
## 5    addm 2008       11.3            18.4                 4
## 6    addm 2010       14.7            23.7               5.3
## 7    addm 2012       14.8            23.4               5.2
## 8    addm 2014       16.8            26.6               6.6
## 9    nsch 2004        9.5                                  
## 10   nsch 2008       16.2                                  
## 20   sped 2007        5.4                                  
## 30   medi 2000        2.3                                  
## 31   medi 2001        2.6                                  
## 32   medi 2002        2.8                                  
## 33   medi 2003        3.0                                  
## 34   medi 2004        3.5                                  
## 35   medi 2005        3.9

[ Tips ] We notice missing data from above.

<h3>
R Fundamentals - Process missing data
</h3>

Count missing values in dataframe:

sum(is.na(ASD_National)) # No missing data recognised by R (NA)
## [1] 0
sum(is.na(ASD_State)) # Some missing data recognised by R (NA)
## [1] 14454

Empty string, “No data” are not considered as missing value by R, thus we need to handle them manually.

# Define several offending strings
na_strings <- c("", "No data", "NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
# Load required function from packages:
if(!require(naniar)){install.packages("naniar")}
## Loading required package: naniar
library(naniar)
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dplyr)
# Uncomment below to show help
# ?replace_with_na_all # Documentation

Replace these defined missing/offending values to R’s internal NA

# "~.x" is a reserved keyword of this function:
ASD_National = replace_with_na_all(ASD_National, condition = ~.x %in% na_strings) 
# Count missing values (R's internal NA) in dataframe:
sum(is.na(ASD_National))
## [1] 650
<h3>
R Fundamentals - Process invalid characters
</h3>

Remove invalid unicode char/string: 92

ASD_National$Source_Full1[ASD_National$Source_Full1 == "National Survey of Children\x92s Health"] <- 
"National Survey of Children's Health"
ASD_National$Source_Full2[ASD_National$Source_Full2 == "nsch-National Survey of Children\x92s Health"] <- 
"nsch-National Survey of Children's Health"
<h3>
R Fundamentals - Delete/Drop dataframe variable
</h3>

Delete/Drop duplicate variable: Prevalence_dup

drop <- c("Prevalence_dup", "Dummy Variable Name")
ASD_National = ASD_National[, !(names(ASD_National) %in% drop)] # Recall Dataframe[rows,columns]
<h3>
R Fundamentals - Create/Add dataframe variable
</h3>

Create one new variable: Source_UC by converting to uppercase letters

ASD_National$Source_UC <- paste(toupper(ASD_National$Source))

Create one new variable: Source_Full3 by combining Source and Source_Full1

ASD_National$Source_Full3 <- paste(toupper(ASD_National$Source), ASD_National$Source_Full1)

Create one new ordinal categorical variable: Prevalence_Rank2 (“Low”, “High”) by binning Prevalence

# Recode Risk into category from Prevalence

# Low [0, 5)
# High [5, +oo) 

ASD_National$Prevalence_Risk2[ASD_National$Prevalence < 5] = "Low"
## Warning: Unknown or uninitialised column: 'Prevalence_Risk2'.
ASD_National$Prevalence_Risk2[ASD_National$Prevalence >= 5 ] = "High"
#
head(ASD_National)
## # A tibble: 6 x 28
##   Source  Year Prevalence Upper.CI Lower.CI Source_Full1 Source_Full2
##   <chr>  <int>      <dbl>    <dbl>    <dbl> <chr>        <chr>       
## 1 addm    2000        6.7      7        6.3 Autism & De… addm-Autism…
## 2 addm    2002        6.6      6.8      6.3 Autism & De… addm-Autism…
## 3 addm    2004        8        8.4      7.6 Autism & De… addm-Autism…
## 4 addm    2006        9        9.3      8.6 Autism & De… addm-Autism…
## 5 addm    2008       11.3     11.7     11   Autism & De… addm-Autism…
## 6 addm    2010       14.7     15.1     14.3 Autism & De… addm-Autism…
## # … with 21 more variables: Male.Prevalence <chr>, Male.Lower.CI <chr>,
## #   Male.Upper.CI <chr>, Female.Prevalence <chr>, Female.Lower.CI <chr>,
## #   Female.Upper.CI <chr>, Non.hispanic.white.Prevalence <chr>,
## #   Non.hispanic.white.Lower.CI <chr>, Non.hispanic.white.Upper.CI <chr>,
## #   Non.hispanic.black.Prevalence <chr>, Non.hispanic.black.Lower.CI <chr>,
## #   Non.hispanic.black.Upper.CI <chr>, Hispanic.Prevalence <chr>,
## #   Hispanic.Lower.CI <chr>, Hispanic.Upper.CI <chr>,
## #   Asian.or.Pacific.Islander.Prevalence <chr>,
## #   Asian.or.Pacific.Islander.Lower.CI <chr>,
## #   Asian.or.Pacific.Islander.Upper.CI <chr>, Source_UC <chr>,
## #   Source_Full3 <chr>, Prevalence_Risk2 <chr>

Create one new ordinal categorical variable: Prevalence_Rank4 (“Low”, “Medium”, “High”, “Very High”) by binning Prevalence

# Recode Risk into category from Prevalence

# Low [0, 5)
# Medium [5, 10)
# High [10, 20)
# Very High [20, +oo) 

ASD_National$Prevalence_Risk4 = "Very High"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 20 ] = "High"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 10 ] = "Medium"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 5] = "Low"
#
head(ASD_National)
## # A tibble: 6 x 29
##   Source  Year Prevalence Upper.CI Lower.CI Source_Full1 Source_Full2
##   <chr>  <int>      <dbl>    <dbl>    <dbl> <chr>        <chr>       
## 1 addm    2000        6.7      7        6.3 Autism & De… addm-Autism…
## 2 addm    2002        6.6      6.8      6.3 Autism & De… addm-Autism…
## 3 addm    2004        8        8.4      7.6 Autism & De… addm-Autism…
## 4 addm    2006        9        9.3      8.6 Autism & De… addm-Autism…
## 5 addm    2008       11.3     11.7     11   Autism & De… addm-Autism…
## 6 addm    2010       14.7     15.1     14.3 Autism & De… addm-Autism…
## # … with 22 more variables: Male.Prevalence <chr>, Male.Lower.CI <chr>,
## #   Male.Upper.CI <chr>, Female.Prevalence <chr>, Female.Lower.CI <chr>,
## #   Female.Upper.CI <chr>, Non.hispanic.white.Prevalence <chr>,
## #   Non.hispanic.white.Lower.CI <chr>, Non.hispanic.white.Upper.CI <chr>,
## #   Non.hispanic.black.Prevalence <chr>, Non.hispanic.black.Lower.CI <chr>,
## #   Non.hispanic.black.Upper.CI <chr>, Hispanic.Prevalence <chr>,
## #   Hispanic.Lower.CI <chr>, Hispanic.Upper.CI <chr>,
## #   Asian.or.Pacific.Islander.Prevalence <chr>,
## #   Asian.or.Pacific.Islander.Lower.CI <chr>,
## #   Asian.or.Pacific.Islander.Upper.CI <chr>, Source_UC <chr>,
## #   Source_Full3 <chr>, Prevalence_Risk2 <chr>, Prevalence_Risk4 <chr>
<h3>
R Fundamentals - Convert to correct data types
</h3>

Review data structure and variable names:

str(ASD_National)
## Classes 'tbl_df', 'tbl' and 'data.frame':    42 obs. of  29 variables:
##  $ Source                              : chr  "addm" "addm" "addm" "addm" ...
##  $ Year                                : int  2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
##  $ Prevalence                          : num  6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
##  $ Upper.CI                            : num  7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
##  $ Lower.CI                            : num  6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
##  $ Source_Full1                        : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ Source_Full2                        : chr  "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
##  $ Male.Prevalence                     : chr  NA "11.5" "12.9" "14.5" ...
##  $ Male.Lower.CI                       : chr  NA NA "12.2" "13.9" ...
##  $ Male.Upper.CI                       : chr  NA NA "13.7" "15.1" ...
##  $ Female.Prevalence                   : chr  NA "2.7" "2.9" "3.2" ...
##  $ Female.Lower.CI                     : chr  NA NA "2.6" "2.9" ...
##  $ Female.Upper.CI                     : chr  NA NA "3.3" "3.5" ...
##  $ Non.hispanic.white.Prevalence       : chr  NA "7.7" "9.7" "9.9" ...
##  $ Non.hispanic.white.Lower.CI         : chr  NA NA "9.1" "9.4" ...
##  $ Non.hispanic.white.Upper.CI         : chr  NA NA "10.4" "10.4" ...
##  $ Non.hispanic.black.Prevalence       : chr  NA "6.5" "6.9" "7.2" ...
##  $ Non.hispanic.black.Lower.CI         : chr  NA NA "6.2" "6.6" ...
##  $ Non.hispanic.black.Upper.CI         : chr  NA NA "7.6" "7.8" ...
##  $ Hispanic.Prevalence                 : chr  NA NA "6.2" "5.9" ...
##  $ Hispanic.Lower.CI                   : chr  NA NA "5" "5.3" ...
##  $ Hispanic.Upper.CI                   : chr  NA NA "7.5" "6.6" ...
##  $ Asian.or.Pacific.Islander.Prevalence: chr  NA NA NA NA ...
##  $ Asian.or.Pacific.Islander.Lower.CI  : chr  NA NA NA NA ...
##  $ Asian.or.Pacific.Islander.Upper.CI  : chr  NA NA NA NA ...
##  $ Source_UC                           : chr  "ADDM" "ADDM" "ADDM" "ADDM" ...
##  $ Source_Full3                        : chr  "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" ...
##  $ Prevalence_Risk2                    : chr  "High" "High" "High" "High" ...
##  $ Prevalence_Risk4                    : chr  "Medium" "Medium" "Medium" "Medium" ...
cbind(names(ASD_National), c(1:length(names(ASD_National))))
##       [,1]                                   [,2]
##  [1,] "Source"                               "1" 
##  [2,] "Year"                                 "2" 
##  [3,] "Prevalence"                           "3" 
##  [4,] "Upper.CI"                             "4" 
##  [5,] "Lower.CI"                             "5" 
##  [6,] "Source_Full1"                         "6" 
##  [7,] "Source_Full2"                         "7" 
##  [8,] "Male.Prevalence"                      "8" 
##  [9,] "Male.Lower.CI"                        "9" 
## [10,] "Male.Upper.CI"                        "10"
## [11,] "Female.Prevalence"                    "11"
## [12,] "Female.Lower.CI"                      "12"
## [13,] "Female.Upper.CI"                      "13"
## [14,] "Non.hispanic.white.Prevalence"        "14"
## [15,] "Non.hispanic.white.Lower.CI"          "15"
## [16,] "Non.hispanic.white.Upper.CI"          "16"
## [17,] "Non.hispanic.black.Prevalence"        "17"
## [18,] "Non.hispanic.black.Lower.CI"          "18"
## [19,] "Non.hispanic.black.Upper.CI"          "19"
## [20,] "Hispanic.Prevalence"                  "20"
## [21,] "Hispanic.Lower.CI"                    "21"
## [22,] "Hispanic.Upper.CI"                    "22"
## [23,] "Asian.or.Pacific.Islander.Prevalence" "23"
## [24,] "Asian.or.Pacific.Islander.Lower.CI"   "24"
## [25,] "Asian.or.Pacific.Islander.Upper.CI"   "25"
## [26,] "Source_UC"                            "26"
## [27,] "Source_Full3"                         "27"
## [28,] "Prevalence_Risk2"                     "28"
## [29,] "Prevalence_Risk4"                     "29"

Convert Prevalence and CIs from categorical/chr to numeric, column 8 to 25

ix <- 8:25 # define an index
# apply()
ASD_National[ix] <- apply(ASD_National[ix], 2, as.numeric) # "2" meand column-wise; "1" means row-wise.
# Uncomment below to show help
# ?apply # Documentation
# or lapply()
ASD_National[ix] <- lapply(ASD_National[ix], as.numeric) # column-wise
# Uncomment below to show help
# ?lapply # Documentation

Convert Source from categorical/chr to categorical/factor

ix <- c(1, 6, 7, 26, 27) # define an index
ASD_National[ix] <- lapply(ASD_National[ix], as.factor)

Create new ordered factor Year_Factor from Year

ASD_National$Year_Factor <- factor(ASD_National$Year, ordered = TRUE)
# Observe the difference of 'Levels' in below two factors
ASD_National$Year_Factor # Ordinal categorical variable
##  [1] 2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 2012 2016 2000 2001 2002
## [16] 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2000
## [31] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
str(ASD_National$Year_Factor)
##  Ord.factor w/ 17 levels "2000"<"2001"<..: 1 3 5 7 9 11 13 15 5 9 ...
ASD_National$Source # Nominal categorical variable
##  [1] addm addm addm addm addm addm addm addm nsch nsch nsch nsch sped sped sped
## [16] sped sped sped sped sped sped sped sped sped sped sped sped sped sped medi
## [31] medi medi medi medi medi medi medi medi medi medi medi medi
## Levels: addm medi nsch sped
str(ASD_National$Source)
##  Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 3 3 ...

Convert Prevalence_Rank2 & Prevalence_Rank4 to ordered factor

# Convert to factor
ASD_National$Prevalence_Risk2 = factor(ASD_National$Prevalence_Risk2, ordered=TRUE,
                                           levels=c("Low", "High"))
# Convert to factor
ASD_National$Prevalence_Risk4 = factor(ASD_National$Prevalence_Risk4, ordered=TRUE,
                                           levels=c("Low", "Medium", "High", "Very High"))
# Optionally, below is manual conversion examples:
# ASD_National$Male.Prevalence = as.numeric(ASD_National$Male.Prevalence)
# ASD_National$Source = as.factor(ASD_National$Source)
# ASD_National$Prevalence_Risk2 = factor(ASD_National$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_National$Prevalence_Risk4 = factor(ASD_National$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))

Optionally, export the processed dataframe data to CSV file.

write.csv(ASD_National, file = "../dataset/ADV_ASD_National_R.csv", row.names = FALSE)
# Read back in above saved file:
# ASD_National <- read.csv("../dataset/ADV_ASD_National_R.csv")
# ASD_National$Year_Factor <- factor(ASD_National$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor

Data Summarization

<h3>
Data Summarization - High Level Data Summary
</h3>
summary(ASD_National)
##   Source        Year        Prevalence        Upper.CI         Lower.CI     
##  addm: 8   Min.   :2000   Min.   : 1.800   Min.   : 1.800   Min.   : 1.700  
##  medi:13   1st Qu.:2004   1st Qu.: 3.950   1st Qu.: 3.950   1st Qu.: 3.875  
##  nsch: 4   Median :2008   Median : 6.650   Median : 6.900   Median : 6.350  
##  sped:17   Mean   :2007   Mean   : 7.952   Mean   : 8.207   Mean   : 7.712  
##            3rd Qu.:2011   3rd Qu.: 9.725   3rd Qu.:10.350   3rd Qu.: 9.625  
##            Max.   :2016   Max.   :29.200   Max.   :30.700   Max.   :27.700  
##                                                                             
##                                                  Source_Full1
##  Autism & Developmental Disabilities Monitoring Network: 8   
##  Medicaid                                              :13   
##  National Survey of Children's Health                  : 4   
##  Special Education Child Count                         :17   
##                                                              
##                                                              
##                                                              
##                                                       Source_Full2
##  addm-Autism & Developmental Disabilities Monitoring Network: 8   
##  medi-Medicaid                                              :13   
##  nsch-National Survey of Children's Health                  : 4   
##  sped-Special Education Child Count                         :17   
##                                                                   
##                                                                   
##                                                                   
##  Male.Prevalence Male.Lower.CI   Male.Upper.CI   Female.Prevalence
##  Min.   :11.50   Min.   :12.20   Min.   :13.70   Min.   :2.700    
##  1st Qu.:13.70   1st Qu.:14.85   1st Qu.:16.07   1st Qu.:3.050    
##  Median :18.40   Median :20.20   Median :21.55   Median :4.000    
##  Mean   :18.71   Mean   :19.22   Mean   :20.62   Mean   :4.271    
##  3rd Qu.:23.55   3rd Qu.:22.93   3rd Qu.:24.32   3rd Qu.:5.250    
##  Max.   :26.60   Max.   :25.80   Max.   :27.40   Max.   :6.600    
##  NA's   :35      NA's   :36      NA's   :36      NA's   :35       
##  Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
##  Min.   :2.600   Min.   :3.300   Min.   : 7.70                
##  1st Qu.:3.100   1st Qu.:3.700   1st Qu.: 9.80                
##  Median :4.300   Median :4.950   Median :12.00                
##  Mean   :4.217   Mean   :4.900   Mean   :12.51                
##  3rd Qu.:4.975   3rd Qu.:5.675   3rd Qu.:15.55                
##  Max.   :6.200   Max.   :7.000   Max.   :17.20                
##  NA's   :36      NA's   :36      NA's   :35                   
##  Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
##  Min.   : 9.100              Min.   :10.40              
##  1st Qu.: 9.925              1st Qu.:10.93              
##  Median :13.100              Median :14.20              
##  Mean   :12.733              Mean   :13.88              
##  3rd Qu.:15.075              3rd Qu.:16.20              
##  Max.   :16.500              Max.   :17.80              
##  NA's   :36                  NA's   :36                 
##  Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
##  Min.   : 6.50                 Min.   : 6.200             
##  1st Qu.: 7.05                 1st Qu.: 7.325             
##  Median :10.20                 Median :10.500             
##  Mean   :10.31                 Mean   :10.200             
##  3rd Qu.:12.70                 3rd Qu.:12.100             
##  Max.   :16.00                 Max.   :15.100             
##  NA's   :35                    NA's   :36                 
##  Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
##  Min.   : 7.600              Min.   : 5.900      Min.   : 5.000   
##  1st Qu.: 8.575              1st Qu.: 6.625      1st Qu.: 5.775   
##  Median :12.000              Median : 9.000      Median : 8.300   
##  Mean   :11.700              Mean   : 9.150      Mean   : 8.333   
##  3rd Qu.:13.700              3rd Qu.:10.625      3rd Qu.: 9.850   
##  Max.   :16.900              Max.   :14.000      Max.   :13.100   
##  NA's   :36                  NA's   :36          NA's   :36       
##  Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
##  Min.   : 6.600    Min.   : 9.70                       
##  1st Qu.: 7.775    1st Qu.:10.97                       
##  Median : 9.750    Median :11.85                       
##  Mean   :10.017    Mean   :11.72                       
##  3rd Qu.:11.425    3rd Qu.:12.60                       
##  Max.   :14.900    Max.   :13.50                       
##  NA's   :36        NA's   :38                          
##  Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
##  Min.   : 8.10                      Min.   :11.60                     
##  1st Qu.: 9.45                      1st Qu.:12.72                     
##  Median :10.30                      Median :13.65                     
##  Mean   :10.12                      Mean   :13.57                     
##  3rd Qu.:10.97                      3rd Qu.:14.50                     
##  Max.   :11.80                      Max.   :15.40                     
##  NA's   :38                         NA's   :38                        
##  Source_UC                                                      Source_Full3
##  ADDM: 8   ADDM Autism & Developmental Disabilities Monitoring Network: 8   
##  MEDI:13   MEDI Medicaid                                              :13   
##  NSCH: 4   NSCH National Survey of Children's Health                  : 4   
##  SPED:17   SPED Special Education Child Count                         :17   
##                                                                             
##                                                                             
##                                                                             
##  Prevalence_Risk2  Prevalence_Risk4  Year_Factor
##  Low :14          Low      :14      2004   : 4  
##  High:28          Medium   :18      2008   : 4  
##                   High     : 8      2012   : 4  
##                   Very High: 2      2000   : 3  
##                                     2002   : 3  
##                                     2006   : 3  
##                                     (Other):21
<h3>
Data Summarization - Summary of <span style="color:blue">numeric</span> variables
</h3>
# Filter only numeric variables/columns
select_if(ASD_National, is.numeric) # library(dplyr)
## # A tibble: 42 x 22
##     Year Prevalence Upper.CI Lower.CI Male.Prevalence Male.Lower.CI
##    <int>      <dbl>    <dbl>    <dbl>           <dbl>         <dbl>
##  1  2000        6.7      7        6.3            NA            NA  
##  2  2002        6.6      6.8      6.3            11.5          NA  
##  3  2004        8        8.4      7.6            12.9          12.2
##  4  2006        9        9.3      8.6            14.5          13.9
##  5  2008       11.3     11.7     11              18.4          17.7
##  6  2010       14.7     15.1     14.3            23.7          23  
##  7  2012       14.8     15.2     14.4            23.4          22.7
##  8  2014       16.8     17.3     16.4            26.6          25.8
##  9  2004        9.5     12        7.4            NA            NA  
## 10  2008       16.2     18.1     14.5            NA            NA  
## # … with 32 more rows, and 16 more variables: Male.Upper.CI <dbl>,
## #   Female.Prevalence <dbl>, Female.Lower.CI <dbl>, Female.Upper.CI <dbl>,
## #   Non.hispanic.white.Prevalence <dbl>, Non.hispanic.white.Lower.CI <dbl>,
## #   Non.hispanic.white.Upper.CI <dbl>, Non.hispanic.black.Prevalence <dbl>,
## #   Non.hispanic.black.Lower.CI <dbl>, Non.hispanic.black.Upper.CI <dbl>,
## #   Hispanic.Prevalence <dbl>, Hispanic.Lower.CI <dbl>,
## #   Hispanic.Upper.CI <dbl>, Asian.or.Pacific.Islander.Prevalence <dbl>,
## #   Asian.or.Pacific.Islander.Lower.CI <dbl>,
## #   Asian.or.Pacific.Islander.Upper.CI <dbl>
# Data summarization
summary(select_if(ASD_National, is.numeric))
##       Year        Prevalence        Upper.CI         Lower.CI     
##  Min.   :2000   Min.   : 1.800   Min.   : 1.800   Min.   : 1.700  
##  1st Qu.:2004   1st Qu.: 3.950   1st Qu.: 3.950   1st Qu.: 3.875  
##  Median :2008   Median : 6.650   Median : 6.900   Median : 6.350  
##  Mean   :2007   Mean   : 7.952   Mean   : 8.207   Mean   : 7.712  
##  3rd Qu.:2011   3rd Qu.: 9.725   3rd Qu.:10.350   3rd Qu.: 9.625  
##  Max.   :2016   Max.   :29.200   Max.   :30.700   Max.   :27.700  
##                                                                   
##  Male.Prevalence Male.Lower.CI   Male.Upper.CI   Female.Prevalence
##  Min.   :11.50   Min.   :12.20   Min.   :13.70   Min.   :2.700    
##  1st Qu.:13.70   1st Qu.:14.85   1st Qu.:16.07   1st Qu.:3.050    
##  Median :18.40   Median :20.20   Median :21.55   Median :4.000    
##  Mean   :18.71   Mean   :19.22   Mean   :20.62   Mean   :4.271    
##  3rd Qu.:23.55   3rd Qu.:22.93   3rd Qu.:24.32   3rd Qu.:5.250    
##  Max.   :26.60   Max.   :25.80   Max.   :27.40   Max.   :6.600    
##  NA's   :35      NA's   :36      NA's   :36      NA's   :35       
##  Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
##  Min.   :2.600   Min.   :3.300   Min.   : 7.70                
##  1st Qu.:3.100   1st Qu.:3.700   1st Qu.: 9.80                
##  Median :4.300   Median :4.950   Median :12.00                
##  Mean   :4.217   Mean   :4.900   Mean   :12.51                
##  3rd Qu.:4.975   3rd Qu.:5.675   3rd Qu.:15.55                
##  Max.   :6.200   Max.   :7.000   Max.   :17.20                
##  NA's   :36      NA's   :36      NA's   :35                   
##  Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
##  Min.   : 9.100              Min.   :10.40              
##  1st Qu.: 9.925              1st Qu.:10.93              
##  Median :13.100              Median :14.20              
##  Mean   :12.733              Mean   :13.88              
##  3rd Qu.:15.075              3rd Qu.:16.20              
##  Max.   :16.500              Max.   :17.80              
##  NA's   :36                  NA's   :36                 
##  Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
##  Min.   : 6.50                 Min.   : 6.200             
##  1st Qu.: 7.05                 1st Qu.: 7.325             
##  Median :10.20                 Median :10.500             
##  Mean   :10.31                 Mean   :10.200             
##  3rd Qu.:12.70                 3rd Qu.:12.100             
##  Max.   :16.00                 Max.   :15.100             
##  NA's   :35                    NA's   :36                 
##  Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
##  Min.   : 7.600              Min.   : 5.900      Min.   : 5.000   
##  1st Qu.: 8.575              1st Qu.: 6.625      1st Qu.: 5.775   
##  Median :12.000              Median : 9.000      Median : 8.300   
##  Mean   :11.700              Mean   : 9.150      Mean   : 8.333   
##  3rd Qu.:13.700              3rd Qu.:10.625      3rd Qu.: 9.850   
##  Max.   :16.900              Max.   :14.000      Max.   :13.100   
##  NA's   :36                  NA's   :36          NA's   :36       
##  Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
##  Min.   : 6.600    Min.   : 9.70                       
##  1st Qu.: 7.775    1st Qu.:10.97                       
##  Median : 9.750    Median :11.85                       
##  Mean   :10.017    Mean   :11.72                       
##  3rd Qu.:11.425    3rd Qu.:12.60                       
##  Max.   :14.900    Max.   :13.50                       
##  NA's   :36        NA's   :38                          
##  Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
##  Min.   : 8.10                      Min.   :11.60                     
##  1st Qu.: 9.45                      1st Qu.:12.72                     
##  Median :10.30                      Median :13.65                     
##  Mean   :10.12                      Mean   :13.57                     
##  3rd Qu.:10.97                      3rd Qu.:14.50                     
##  Max.   :11.80                      Max.   :15.40                     
##  NA's   :38                         NA's   :38

[ Tips ] We notice missing data in a few Prevalence variables.

# Calculate average Prevalence, no error
mean(ASD_National$Prevalence)
## [1] 7.952381
mean(ASD_National$Prevalence[ASD_National$Source == 'addm'])
## [1] 10.9875
mean(ASD_National$Prevalence[ASD_National$Source == 'medi'])
## [1] 4.676923
mean(ASD_National$Prevalence[ASD_National$Source == 'nsch'])
## [1] 19.025
mean(ASD_National$Prevalence[ASD_National$Source == 'sped'])
## [1] 6.423529
# Calculate average Male.Prevalence, there is error!
mean(ASD_National$Male.Prevalence)
## [1] NA
# Because of NA, mean() cannot process, thus we use na.rm to ignore NAs
mean(ASD_National$Male.Prevalence, na.rm = TRUE)
## [1] 18.71429
mean(ASD_National$Female.Prevalence, na.rm = TRUE)
## [1] 4.271429
# Count occurrences of uniques values in a variable/column: number of rows (of data entry) per year
table(ASD_National$Year) # ?table
## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##    3    2    3    2    4    2    3    2    4    2    3    2    4    1    2    1 
## 2016 
##    2
<h3>
Data Summarization - Summary of <span style="color:blue">categorical</span> variables
</h3>
# List of categorical variables
names(select_if(ASD_National, is.factor)) # All categorical variables are factor data type
## [1] "Source"           "Source_Full1"     "Source_Full2"     "Source_UC"       
## [5] "Source_Full3"     "Prevalence_Risk2" "Prevalence_Risk4" "Year_Factor"
names(select_if(ASD_National, is.character)) # No categorical variable is character data type
## character(0)
# Look at summary
summary(select_if(ASD_National, is.factor))
##   Source                                                   Source_Full1
##  addm: 8   Autism & Developmental Disabilities Monitoring Network: 8   
##  medi:13   Medicaid                                              :13   
##  nsch: 4   National Survey of Children's Health                  : 4   
##  sped:17   Special Education Child Count                         :17   
##                                                                        
##                                                                        
##                                                                        
##                                                       Source_Full2 Source_UC
##  addm-Autism & Developmental Disabilities Monitoring Network: 8    ADDM: 8  
##  medi-Medicaid                                              :13    MEDI:13  
##  nsch-National Survey of Children's Health                  : 4    NSCH: 4  
##  sped-Special Education Child Count                         :17    SPED:17  
##                                                                             
##                                                                             
##                                                                             
##                                                       Source_Full3
##  ADDM Autism & Developmental Disabilities Monitoring Network: 8   
##  MEDI Medicaid                                              :13   
##  NSCH National Survey of Children's Health                  : 4   
##  SPED Special Education Child Count                         :17   
##                                                                   
##                                                                   
##                                                                   
##  Prevalence_Risk2  Prevalence_Risk4  Year_Factor
##  Low :14          Low      :14      2004   : 4  
##  High:28          Medium   :18      2008   : 4  
##                   High     : 8      2012   : 4  
##                   Very High: 2      2000   : 3  
##                                     2002   : 3  
##                                     2006   : 3  
##                                     (Other):21
summary(select_if(ASD_National, is.character))
## < table of extent 0 x 0 >
# Count occurrences of uniques values in a variable/column
table(ASD_National$Source)
## 
## addm medi nsch sped 
##    8   13    4   17
table(ASD_National$Source_Full3)
## 
## ADDM Autism & Developmental Disabilities Monitoring Network 
##                                                           8 
##                                               MEDI Medicaid 
##                                                          13 
##                   NSCH National Survey of Children's Health 
##                                                           4 
##                          SPED Special Education Child Count 
##                                                          17
table(ASD_National$Year_Factor)
## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##    3    2    3    2    4    2    3    2    4    2    3    2    4    1    2    1 
## 2016 
##    2
table(ASD_National$Prevalence) # numeric is also possible
## 
##  1.8  2.1  2.3  2.6  2.8    3  3.5  3.6  3.9  4.1  4.4  4.8  5.1  5.4  5.6  5.9 
##    1    1    1    2    1    2    1    1    1    1    1    1    1    1    1    1 
##  6.2  6.4  6.6  6.7    7  7.1  7.7    8  8.2  8.4    9  9.1  9.5  9.8 10.5 11.2 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 11.3 11.9 14.7 14.8 16.2 16.8 21.2 29.2 
##    1    1    1    1    1    1    1    1
# Display unique values (levels) of a factor categorical 
lapply(select_if(ASD_National, is.factor), levels)
## $Source
## [1] "addm" "medi" "nsch" "sped"
## 
## $Source_Full1
## [1] "Autism & Developmental Disabilities Monitoring Network"
## [2] "Medicaid"                                              
## [3] "National Survey of Children's Health"                  
## [4] "Special Education Child Count"                         
## 
## $Source_Full2
## [1] "addm-Autism & Developmental Disabilities Monitoring Network"
## [2] "medi-Medicaid"                                              
## [3] "nsch-National Survey of Children's Health"                  
## [4] "sped-Special Education Child Count"                         
## 
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
## 
## $Source_Full3
## [1] "ADDM Autism & Developmental Disabilities Monitoring Network"
## [2] "MEDI Medicaid"                                              
## [3] "NSCH National Survey of Children's Health"                  
## [4] "SPED Special Education Child Count"                         
## 
## $Prevalence_Risk2
## [1] "Low"  "High"
## 
## $Prevalence_Risk4
## [1] "Low"       "Medium"    "High"      "Very High"
## 
## $Year_Factor
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"
# or using variable names
lapply(ASD_National[c('Source_UC', 'Year_Factor')], levels)
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
## 
## $Year_Factor
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"
# Pivot of counting occurrences
table(ASD_National$Source_Full3, ASD_National$Year) # table(ASD_National$Year, ASD_National$Source_Full3)
##                                                              
##                                                               2000 2001 2002
##   ADDM Autism & Developmental Disabilities Monitoring Network    1    0    1
##   MEDI Medicaid                                                  1    1    1
##   NSCH National Survey of Children's Health                      0    0    0
##   SPED Special Education Child Count                             1    1    1
##                                                              
##                                                               2003 2004 2005
##   ADDM Autism & Developmental Disabilities Monitoring Network    0    1    0
##   MEDI Medicaid                                                  1    1    1
##   NSCH National Survey of Children's Health                      0    1    0
##   SPED Special Education Child Count                             1    1    1
##                                                              
##                                                               2006 2007 2008
##   ADDM Autism & Developmental Disabilities Monitoring Network    1    0    1
##   MEDI Medicaid                                                  1    1    1
##   NSCH National Survey of Children's Health                      0    0    1
##   SPED Special Education Child Count                             1    1    1
##                                                              
##                                                               2009 2010 2011
##   ADDM Autism & Developmental Disabilities Monitoring Network    0    1    0
##   MEDI Medicaid                                                  1    1    1
##   NSCH National Survey of Children's Health                      0    0    0
##   SPED Special Education Child Count                             1    1    1
##                                                              
##                                                               2012 2013 2014
##   ADDM Autism & Developmental Disabilities Monitoring Network    1    0    1
##   MEDI Medicaid                                                  1    0    0
##   NSCH National Survey of Children's Health                      1    0    0
##   SPED Special Education Child Count                             1    1    1
##                                                              
##                                                               2015 2016
##   ADDM Autism & Developmental Disabilities Monitoring Network    0    0
##   MEDI Medicaid                                                  0    0
##   NSCH National Survey of Children's Health                      0    1
##   SPED Special Education Child Count                             1    1
# Pivot of counting occurrences
table(ASD_National$Prevalence_Risk2, ASD_National$Source)
##       
##        addm medi nsch sped
##   Low     0    7    0    7
##   High    8    6    4   10
# Pivot of counting occurrences
table(ASD_National$Prevalence_Risk4, ASD_National$Source)
##            
##             addm medi nsch sped
##   Low          0    7    0    7
##   Medium       4    6    1    7
##   High         4    0    1    3
##   Very High    0    0    2    0

Data Visualisation (Base Graphic)

# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Data Visualisation (Base Graphic) - Histogram (distribution of binned continuous variable)
</h3>

https://www.statmethods.net/graphs/density.html

hist(ASD_National$Prevalence)

par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2
hist(ASD_National$Male.Prevalence)
hist(ASD_National$Female.Prevalence)

par(mfrow=c(1, 1)) # Reset to one plot on one page
# Histogram with annotations
hist(ASD_National$Prevalence,
     main = "Frequency of National ASD Prevalence", # Chart title
     xlab = "Prevalence per 1,000 Children", # x axis label
     ylab = "Frequency or Occurrences",# y axis label
     sub  = "Year 2000 - 2016", # Chart subtitle at bottom
     col.main="blue", col.lab="black", col.sub="darkgrey") # Colours

<h3>
Density plot (distribution for continuous variable normalized to 100% area under curve)
</h3>

https://www.statmethods.net/graphs/density.html

par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2

plot(density(ASD_National$Prevalence))

# Density plot with annotations
plot(density(ASD_National$Prevalence),
     main = "Density of National ASD Prevalence",
     xlab = "Prevalence per 1,000 Children",
     ylab = "Frequency or Occurrences",
     sub  = "Year 2000 - 2016",
     col.main="blue", col.lab="black", col.sub="darkgrey")

par(mfrow=c(1, 1))
<h3>
Boxplot plot (median, 25% quantile,75% quantile)
</h3>

https://www.statmethods.net/graphs/boxplot.html

https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-vs-quartile

0 quartile = 0 quantile = 0 percentile

1 quartile = 0.25 quantile = 25 percentile

2 quartile = .5 quantile = 50 percentile (median)

3 quartile = .75 quantile = 75 percentile

4 quartile = 1 quantile = 100 percentile

par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2

# All children prevalence with and without 95% confidence side by side:
boxplot(ASD_National$Prevalence, notch = TRUE) # 95% confidence interval - a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ
boxplot(ASD_National$Prevalence) # All children

par(mfrow=c(1, 1))
par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2

# Male prevalence and Female prevalence side by side:
boxplot(ASD_National$Male.Prevalence, ylim = c(0, 35), notch = TRUE) # Male children
## Warning in bxp(list(stats = structure(c(11.5, 13.7, 18.4, 23.55, 26.6), .Dim =
## c(5L, : some notches went outside hinges ('box'): maybe set notch=FALSE
boxplot(ASD_National$Female.Prevalence, ylim = c(0, 35), notch = TRUE) # Female children
## Warning in bxp(list(stats = structure(c(2.7, 3.05, 4, 5.25, 6.6), .Dim = c(5L, :
## some notches went outside hinges ('box'): maybe set notch=FALSE

par(mfrow=c(1, 1))
# Display value ranges
# numeric:
range(ASD_National$Prevalence)
## [1]  1.8 29.2
range(ASD_National$Year)
## [1] 2000 2016
# categorical:
min(ASD_National$Year_Factor)
## [1] 2000
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
max(ASD_National$Year_Factor)
## [1] 2016
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
# Create 'Prevalence' box plots break by 'Source'
boxplot(ASD_National$Prevalence ~ ASD_National$Source,
        main = "National ASD Prevalence by Data Source",
        xlab = "Data Source",
        ylab = "Prevalence per 1,000 Children",
        sub  = "Year 2000 - 2016",
        col.main="blue", col.lab="black", col.sub="darkgrey")

<h3>
    Quiz:
</h3>
<p>
    Set noth=TRUE to above boxplot. Are there overlapping among four data sources?
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Base Graphic) - Bar plot
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk2, ASD_National$Source)
#counts = table(ASD_National$Source, ASD_National$Prevalence_Risk4)
barplot(counts,
        main="Prevalence by Data Sources and Risk Levels",
        xlab="Data Sources", col=c("white", "lightgrey"),
        ylab="Occurrences",
        legend = rownames(counts), 
        args.legend = list(x="topleft", bty = "n", cex = 0.85, y.intersp=2))

# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk2, ASD_National$Source) # Count of Risk records, split by Source
barplot(counts,
        main="Prevalence by Data Sources and Risk Levels",
        xlab="Data Sources",
        ylab="Occurrences",
        col=c("white", "lightgrey"),
        legend = rownames(counts), 
        args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 2))

# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk4, ASD_National$Source) # Count of Risk records, split by Source
barplot(counts,
        main="Prevalence Occurrence by Source and Risk",
        xlab="Data Sources",
        ylab="Occurrences",
        col=c("lightyellow", "orange", "red","darkred"),
        legend = rownames(counts), 
        args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 2))

<h3>
Data Visualisation (Base Graphic) - Line chart
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=5)
# ----------------------------------
# [National] < Prevalence has changed over Time >
# ----------------------------------
# Prevalence over Year
# Use Year        as x-axis: y value Prevalence is NOT aggregated for different data sources
plot(ASD_National$Year, ASD_National$Prevalence) 

# Use Year_factor as x-axis: y value Prevalence is     aggregated for different data sources
plot(ASD_National$Year_Factor, ASD_National$Prevalence) 

# table(ASD_National$Source_Full3)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)

par(mfrow=c(2, 2))

# Prevalence over Year, from data source: 
# addm-Autism & Developmental Disabilities Monitoring Network
plot(ASD_National$Year[ASD_National$Source == 'addm'], 
     ASD_National$Prevalence[ASD_National$Source == 'addm'])

# Prevalence over Year, from data source: 
# medi-Medicaid
plot(ASD_National$Year[ASD_National$Source == 'medi'], 
     ASD_National$Prevalence[ASD_National$Source == 'medi'])

# Prevalence over Year, from data source: 
# nsch-National Survey of Children Health
plot(ASD_National$Year[ASD_National$Source == 'nsch'], 
     ASD_National$Prevalence[ASD_National$Source == 'nsch'])

# Prevalence over Year, from data source: 
# sped-Special Education Child Count
plot(ASD_National$Year[ASD_National$Source == 'sped'], 
     ASD_National$Prevalence[ASD_National$Source == 'sped'])

par(mfrow=c(1, 1)) # Reset to one plot on one page
# ----------------------------------
# Add more annotations to above plots
# ----------------------------------
# Color list
# addm : darkblue
# medi : orange
# nsch : darkred
# sped : skyblue

par(mfrow=c(2, 2))

# Prevalence over Year, from data source: 
# addm-Autism & Developmental Disabilities Monitoring Network
plot(ASD_National$Year[ASD_National$Source == 'addm'], 
     ASD_National$Prevalence[ASD_National$Source == 'addm'],
     type="l", # dot/point type
     lty=1, # line type
     lwd=3, # line width
     col="darkblue", # line color
     xlab="Year", 
     ylab="Prevalence per 1,000 Children", 
     ylim = c(0, 30), # Set value range of y axis
     main="[addm] Prevalence Estimates Over Time",
     sub  = "zhan.gu@nus.edu.sg",
     col.main="blue", col.lab="black", col.sub="darkgrey")

# Prevalence over Year, from data source: 
# medi-Medicaid
plot(ASD_National$Year[ASD_National$Source == 'medi'], 
     ASD_National$Prevalence[ASD_National$Source == 'medi'],
     type="b", lty=1, lwd=3,  col="orange",
     xlab="Year", 
     ylab="Prevalence per 1,000 Children", 
     ylim = c(0, 30), # Set value range of y axis
     main="[medi] Prevalence Estimates Over Time",
     sub  = "zhan.gu@nus.edu.sg",
     col.main="blue", col.lab="black", col.sub="darkgrey")

# Prevalence over Year, from data source: 
# nsch-National Survey of Children Health
plot(ASD_National$Year[ASD_National$Source == 'nsch'], 
     ASD_National$Prevalence[ASD_National$Source == 'nsch'],
     type="l", lty=2, lwd=3,  col="darkred",
     xlab="Year", 
     ylab="Prevalence per 1,000 Children", 
     ylim = c(0, 30), # Set value range of y axis
     main="[nsch] Prevalence Estimates Over Time",
     sub  = "zhan.gu@nus.edu.sg",
     col.main="blue", col.lab="black", col.sub="darkgrey")

# Prevalence over Year, from data source: 
# sped-Special Education Child Count
plot(ASD_National$Year[ASD_National$Source == 'sped'], 
     ASD_National$Prevalence[ASD_National$Source == 'sped'],
     type="l", lty=3, lwd=3,  col="skyblue",
     xlab="Year", 
     ylab="Prevalence per 1,000 Children", 
     ylim = c(0, 30), # Set value range of y axis
     main="[sped] Prevalence Estimates Over Time",
     sub  = "zhan.gu@nus.edu.sg",
     col.main="blue", col.lab="black", col.sub="darkgrey")

par(mfrow=c(1, 1)) # Reset to one plot on one page
<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span> by [ Data Source ]
</h3>

Create multiple lines within a single chart

# ----------------------------------
# [National] < Prevalence Varies over Time/Year by Data Source >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'], 
     ASD_National$Prevalence[ASD_National$Source == 'addm'], 
     col = "darkblue", lty = 1, lwd = 2,
     type = "b", # use dot/point
     pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
     xlab="Year", 
     xlim=c(2000, 2016), # Set x axis value range
     ylab="Prevalence per 1,000 Children", 
     ylim=c(0, 30), # Set y axis value range
     main="Prevalence Estimates Over Time by Data Source",
     col.main="black", col.lab="black", col.sub="grey",
     frame = FALSE, # Remove frame
     axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis

# Add another line
lines(ASD_National$Year[ASD_National$Source == 'medi'], 
      ASD_National$Prevalence[ASD_National$Source == 'medi'], 
      pch = 1, col = "orange", type = "b", lty = 1, lwd = 2
)
# Add another line
lines(ASD_National$Year[ASD_National$Source == 'nsch'], 
      ASD_National$Prevalence[ASD_National$Source == 'nsch'], 
      pch = 2, col = "darkred", type = "b", lty = 1, lwd = 2
)
# Add another line
lines(ASD_National$Year[ASD_National$Source == 'sped'], 
      ASD_National$Prevalence[ASD_National$Source == 'sped'], 
      pch = 5, col = "skyblue", type = "b", lty = 1, lwd = 2
)
# Add a legend to the plot
legend("topleft", legend=levels(ASD_National$Source),
       col=c("darkblue", "orange", "darkred", "skyblue"), 
       pch = 20, # dot in a line
       lty = 1, # line type
       lwd = 2, # line width
       cex=0.8, # size of text
       bty = 'n' # Without frame
)

R pch: dot/point type: http://www.endmemo.com/program/R/pchsymbols.php

R plot colour list: https://www.r-graph-gallery.com/42-colors-names.html

<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] over [ Year ]
</h3>
# ----------------------------------
# [addm] < Prevalence Varies by Sex >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'], 
     ASD_National$Prevalence[ASD_National$Source == 'addm'], 
     col = "grey", lty = 1, lwd = 2,
     type = "l", # use dot/point
     pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
     xlab="Year", 
     xlim=c(2000, 2016), # Set x axis value range
     ylab="Prevalence per 1,000 Children", 
     ylim=c(0, 30), # Set y axis value range
     main="Prevalence Estimates by Sex [ADDM]",
     col.main="black", col.lab="black", col.sub="grey",
     frame = FALSE, # Remove frame
     axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis

# Add Female prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Female.Prevalence[ASD_National$Source == 'addm'], 
      pch = 1, col = "orange", type = "l", lty = 1, lwd = 2)
# Add Female prevalence lower CI
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Female.Lower.CI[ASD_National$Source == 'addm'], 
      pch = 1, col = "orange", type = "l", lty = 3, lwd = 1)
# Add Female prevalence upper CI
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Female.Upper.CI[ASD_National$Source == 'addm'], 
      pch = 1, col = "orange", type = "l", lty = 3, lwd = 1)

# Add Male prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Male.Prevalence[ASD_National$Source == 'addm'], 
      pch = 1, col = "blue", type = "l", lty = 1, lwd = 2)
# Add Male prevalence lower CI
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Male.Lower.CI[ASD_National$Source == 'addm'], 
      pch = 1, col = "blue", type = "l", lty = 3, lwd = 1)
# Add Male prevalence upper CI
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Male.Upper.CI[ASD_National$Source == 'addm'], 
      pch = 1, col = "blue", type = "l", lty = 3, lwd = 1)
# Add a legend to the plot
legend("topleft", legend=c('ADDM Average', 'Female with 95% CI', 'Male with 95% CI'),
       col=c("grey", "orange", "blue"), 
       #       pch = 20, # dot in a line
       lty = 1, # line type
       lwd = 2, # line width
       cex=0.8, # size of text
       bty = 'n' # Without frame
)

<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span> [ Source: ADDM ]
</h3>
# ----------------------------------
# [addm] < Prevalence Varies by Race and Ethnicity >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'], 
     ASD_National$Prevalence[ASD_National$Source == 'addm'], 
     col = "grey", lty = 1, lwd = 2,
     type = "l", # use dot/point
     pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
     xlab="Year", 
     xlim=c(2000, 2016), # Set x axis value range
     ylab="Prevalence per 1,000 Children", 
     ylim=c(0, 30), # Set y axis value range
     main="Prevalence Estimates by Race/Ethnicity [ADDM]",
     col.main="black", col.lab="black", col.sub="grey",
     frame = FALSE, # Remove frame
     axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis

# R plot colour list: https://www.r-graph-gallery.com/42-colors-names.html

# Add Asian.or.Pacific.Islander.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Asian.or.Pacific.Islander.Prevalence[ASD_National$Source == 'addm'], 
      pch = 20, col = "darkred", type = "b", lty = 1, lwd = 2)
# Add Hispanic.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Hispanic.Prevalence[ASD_National$Source == 'addm'], 
      pch = 20, col = "darkorchid3", type = "b", lty = 1, lwd = 2)
# Add Non.hispanic.black.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Non.hispanic.black.Prevalence[ASD_National$Source == 'addm'], 
      pch = 20, col = "deepskyblue3", type = "b", lty = 1, lwd = 2)
# Add Non.hispanic.white.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'], 
      ASD_National$Non.hispanic.white.Prevalence[ASD_National$Source == 'addm'], 
      pch = 20, col = "chartreuse3", type = "b", lty = 1, lwd = 2)

# Add a legend to the plot
legend("topleft", legend=c('ADDM Average', 
                           'Non-Hispanic White',
                           'Non-Hispanic Black',
                           'Hispanic', 
                           'Asian/Pacific Islander'),
       col=c("grey", "chartreuse3", "deepskyblue3", "darkorchid3", "darkred"), 
       pch = 20, # dot in a line
       lty = 1, # line type
       lwd = 2, # line width
       cex=0.8, # size of text
       bty = 'n' # Without frame
)

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
    Quiz:
</h3>
<p>
    Add 95% Confidence Interval to above plot
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
    Quiz:
</h3>
<p>
    Use talbe() to count No. prevalence records for each Data Source. Then use barplot() to visualize.
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
    Quiz:
</h3>
<p>
    Which Data Sources are available in which years?
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
    Quiz:
</h3>
<p>
    Which Data Source has breakdown Prevalvence data by sex/gender?
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
    Quiz:
</h3>
<p>
    Which Data Source has breakdown Prevalvence data by race and ethnicity?
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

Excellent! You have completed the workshop notebook!

Connect with the author:

This notebook was written by GU Zhan (Sam).

Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.

Copyright © 2020 GU Zhan

This notebook and its source code are released under the terms of the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

<a href="">
</a>

Appendices

<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>

https://github.com/telescopeuser/S-SB-Workshop

<h3>
Neural Network 101 using nnet()
</h3>

Use nerual net to classify three different species of iris flowers, based on four features/measurements of: * length of the petals * width of the petals * length of the sepals * width of the sepals

# ----------------------------------
# Neural Network 101 using nnet()
# ----------------------------------
if(!require(nnet)){install.packages("nnet")}
## Loading required package: nnet
library("nnet")
# ?nnet
 
# < Case: predict three different iris flower types >

# https://en.wikipedia.org/wiki/Iris_flower_data_set
# https://archive.ics.uci.edu/ml/datasets/iris

# Data preparation: split iris data in two halves, for training & testing respectively.
ir <- rbind(iris3[,,1],iris3[,,2],iris3[,,3])
targets <- class.ind( c(rep("setosa", 50), rep("versicolor", 50), rep("virginica", 50)) )
samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))
# Model training (machine learning / data fitting)
ir1 <- nnet(ir[samp,], targets[samp,], size = 2, rang = 0.1,
            decay = 5e-4, maxit = 200)
## # weights:  19
## initial  value 56.322878 
## iter  10 value 26.168810
## iter  20 value 17.958623
## iter  30 value 0.578242
## iter  40 value 0.523268
## iter  50 value 0.494103
## iter  60 value 0.484389
## iter  70 value 0.481326
## iter  80 value 0.479536
## iter  90 value 0.477770
## iter 100 value 0.477145
## iter 110 value 0.476933
## iter 120 value 0.476866
## iter 130 value 0.476813
## iter 140 value 0.476791
## iter 150 value 0.476789
## iter 160 value 0.476789
## final  value 0.476788 
## converged
# Model evaluation function
test.cl <- function(true, pred) {
  true <- max.col(true)
  cres <- max.col(pred)
  table(true, cres)
}
# Model evaluation
test.cl(targets[-samp,], predict(ir1, ir[-samp,]))
##     cres
## true  1  2  3
##    1 25  0  0
##    2  0 22  3
##    3  0  0 25
<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
</a>